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ABSTRACT 

Operational  cloud  forecasts  generated  by  the  Coupled  Ocean-Atmosphere  Mesoscale  Prediction  System 
(COAMPS)^  were  verified  over  the  eastern  Pacific  Ocean.  The  study  focused  on  the  accuracy  of  cloud 
forecasts  associated  with  extratropical  cyclone  and  convective  activity  during  the  late  winter  and  spring  of 
2007.  The  condensed  total  water  (liquid  and  solid)  path  was  used  as  a  proxy  for  cloud  cover  to  compare  the 
forecasts  with  retrievals  from  the  Geostationary  Operational  Environmental  Satellites  (GOES).  Analyses  of 
the  GOES  retrievals  indicate  that  deep  cloud  systems  were  generally  well  represented  during  daylight  hours. 
Thus,  the  bulk  of  the  verification  focused  on  the  general  aspects  of  quality  and  timing  of  these  deep  systems. 
Multiple  statistics  were  collected,  ranging  from  simple  correlations  and  histograms  to  more  sophisticated 
fuzzy  and  composite  statistics.  The  results  show  that  synoptic-scale  systems  were  generally  well  predicted  to 
at  least  two  days,  with  the  primary  error  being  an  overestimation  of  deep  cloud  occurrence.  Smaller  subsynoptic- 
scale  systems  were  subject  to  spatial  and  timing  biases  in  that  a  number  of  the  forecasts  were  lagged  by  3-6  h. 
Despite  the  bias,  60%-70%  of  the  forecasts  of  the  mesoscale  phenomena  displayed  useful  skill. 


1.  Introduction 

Many  current  technologies  are  increasingly  sensitive 
to  the  effects  of  cloud  cover.  Cirrus  often  interferes  with 
military  applications  that  depend  on  a  clear  line  of  sight 
(Norquist  1999).  Low  ceilings  impact  both  civilian  and 
military  aviation  (Carter  and  Glahn  1976),  and  in  this 
era  of  tight  schedules  even  minor  disruptions  have  sig¬ 
nificant  consequences.  Depending  on  the  application, 
depictions  of  the  cloud  state  are  desired  at  many  scales, 
and  accurate  forecasts  at  long  lead  times  are  highly 
desirable  for  planning  purposes.  However,  cloud  pre¬ 
diction  has  long  been  problematic. 

The  large  range  of  scales  (from  microscopic  to  syn¬ 
optic)  impacting  cloud  development  is  a  challenge  to 
even  the  highest-resolution  mesoscale  models.  How¬ 
ever,  bulk  cloud  parameterizations  and  microphysical 
schemes  have  recently  shown  the  ability  to  produce 


^  CO  AMPS  is  a  registered  trademark  of  the  Naval  Research 
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useful  cloud  forecasts.  Chaboureau  and  Pinty  (2006) 
compared  brightness  temperature  forecasts  from  the 
nonhydrostatic  Meso-NH  regional  model  with  Meteo- 
sat  Second  Generation  observations  over  Brazil.  With  a 
horizontal  grid  spacing  of  30  km,  the  model  successfully 
reproduced  diurnal  variations  in  convective  cloudiness 
as  well  as  lower-frequency  variations  produced  by 
changing  weather  regimes.  Other  comparisons  by  Cha¬ 
boureau  et  al.  (2002)  show  good  qualitative  agreement 
between  simulated  cloud  systems  and  retrieved  ice  and 
liquid  water  path  observations.  These  case  studies  were 
conducted  for  both  midlatitude  and  subtropical  cloud 
systems  using  horizontal  grid  resolutions  from  12  to  75 
km.  Additional  studies  by  Sohne  et  al.  (2006),  Li  et  al. 
(2005),  and  Chevallier  and  Kelly  (2002)  have  noted 
good  quantitative  agreement  between  cloud  forecasts 
and  the  accompanying  satellite  observations. 

As  model  resolution  increases,  cloud  forecasts  become 
increasingly  realistic.  Like  precipitation  forecasts,  this  real¬ 
ism  adds  value  but  does  not  necessarily  lead  to  higher  scores 
(Ebert  and  McBride  2000).  Bieringer  et  al.  (2006)  com¬ 
pared  3-km  fifth-generation  Pennsylvania  State  University- 
National  Center  for  Atmospheric  Research  Mesoscale 
Model  (MM5)  ceiling  forecasts  with  those  from  the  parent 
9-  and  27-km  domains,  as  well  as  forecasts  from  a  separate 
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20-km  Rapid  Update  Cycle  (RUC)  domain.  Categorical 
statistics  based  on  pointwise  comparisons  with  auto¬ 
mated  observations  showed  comparable  performance  at 
all  resolutions.  However,  spatial  variability  within  the  3- 
km  nest  was  a  potential  indicator  of  forecast  error.  Er¬ 
rors  were  significantly  reduced  by  averaging  the  3-km 
forecasts  over  a  54  km  X  54  km  region.  Sohne  et  al. 
(2006)  noted  that  high-resolution  forecasts  of  North 
African  convective  clouds  performed  significantly  bet¬ 
ter  in  a  spatially  averaged  sense  than  their  lower-reso¬ 
lution  counterparts.  Essentially  these  studies  indicate 
that  the  small-scale  variance  is  better  simulated  at  high 
resolution.  Even  if  the  forecast  weather  features  are  not 
identically  located  in  their  observed  positions,  these 
gains  can  be  very  useful  if  they  lead  to  improvements  in 
probability  forecasts  of  high  impact  weather. 

In  this  paper,  cloud  total  liquid  +  ice  condensed  wa¬ 
ter  path  (LWP)  forecasts  from  the  Coupled  Ocean- 
Atmosphere  Mesoscale  Prediction  System  (CO AMPS; 
Hodur  1997)  are  compared  with  Geostationary  Opera¬ 
tional  Environmental  Satellite  (GOES)  retrievals.  The 
main  goal  of  the  verification  is  to  determine  the  viability 
of  cloud  forecasts  over  the  eastern  Pacific,  where  vast 
areas  of  sparse  data  present  a  uniquely  challenging 
forecast  environment.  The  fact  that  CO  AMPS  produces 
useful  forecasts  in  this  region  attests  to  the  many  ad¬ 
vances  in  data  assimilation  and  physics  representation. 
Abundant  cloudiness  in  this  region  poses  a  unique 
challenge  to  verification  as  cloud  amount  forecasts  score 
deceptively  well  because  of  the  high  probability  of  a 
correct  random  forecast.  Upper-tropospheric  cloudi¬ 
ness  is  particularly  interesting  for  military  aviation,  but 
it  can  be  quite  difficult  to  verify  because  of  brightness 
temperature  ambiguities.  Thus,  the  focus  will  be  on  the 
deep,  cloud-producing  systems  typically  associated  with 
synoptic  fronts  and  subtropical  convection.  Since  these 
systems  are  responsible  for  a  large  percentage  of  upper- 
tropospheric  clouds,  tracking  the  forecast  accuracy  of 
these  systems  is  helpful.  Verification  statistics  of  many 
types  will  be  used  to  track  model  performance,  starting 
from  some  basic  statistics  and  progressing  to  more  so¬ 
phisticated  methods.  New  verification  techniques  have 
recently  shown  promise  in  accounting  for  the  small- 
scale  spatial  errors.  Fuzzy  methods  (Roberts  and  Lean 
2008)  and  the  composite  method  (Nachamkin  2004)  will 
be  employed  here  to  evaluate  forecast  performance 
with  respect  to  scale  as  well  as  event  occurrence. 

2.  Data 

a.  Atmospheric  forecast  model 

The  operational  nonhydrostatic  CO  AMPS  forecasts 
for  the  eastern  Pacific  conducted  at  the  Fleet  Numerical 


Meteorology  and  Oceanography  Center  (FNMOC)  were 
selected  for  this  study.  The  domain  setup  consists  of 
two  one-way  nested  grids  with  horizontal  spacings  of 
81  and  27  km  (Fig.  1),  and  30  vertical  sigma  levels  with 
the  lowest  at  10  m  AGL  and  the  highest  near  35  km. 
Forecasts  were  initialized  daily  at  0000,  0600,  1200, 
and  1800  UTC,  using  the  Naval  Research  Laboratory’s 
Atmospheric  Variational  Data  Assimilation  System 
(NAVDAS;  Daley  and  Barker  2001).  The  previous  6-h 
forecast  acted  as  a  first  guess.  Boundary  conditions  were 
supplied  from  the  Navy  Operational  Global  Atmo¬ 
spheric  Prediction  System  (NOGAPS;  Hogan  et  al. 
2002)  at  3-h  intervals  using  a  Davies  (1976)  scheme. 
Subgrid-scale  convection  was  parameterized  using  the 
Kain-Fritsch  scheme  (Kain  and  Fritsch  1993),  while  the 
explicit  microphysics  were  parameterized  using  a  mod¬ 
ified  Rutledge  and  Hobbs  (1983,  1984)  scheme  de¬ 
scribed  by  Schmidt  (2001).  These  modifications  include 
predictive  equations  for  graupel  and  drizzle,  as  well  as 
the  Meyers  et  al.  (1992)  ice  nucleation,  homogeneous 
freezing,  ice  multiplication  processes,  and  nonzero 
pristine  ice  fall  speeds.  For  this  study,  the  27-km  LWP 
forecasts  valid  from  0  to  48  h  were  collected  for  vali¬ 
dation  from  1  February  through  31  May  2007.  Notably, 
the  LWP  retrievals  (described  below)  were  restricted  to 
daylight  hours.  Thus,  simulations  initialized  at  0000  UTC 
were  used  to  verify  forecasts  with  lead  times  of  0-3, 
18-27,  and  42^8  h,  while  the  runs  initialized  at  1200  UTC 
were  used  to  verify  the  6-15-  and  30-39-h  forecasts.^ 
Some  fluctuations  occurred  in  the  statistical  time  series, 
but  the  general  trends  remained  consistent  for  both  ini¬ 
tialization  times. 

b.  Satellite  observations 

GOES  LWP  observations  were  retrieved  using  opti¬ 
mal  estimation  techniques  described  by  Mitrescu  et  al. 
(2006).  The  method  involves  prescribing  forward  radi¬ 
ative  and  microphysical  models  to  retrieve  cloud  prop¬ 
erties  from  passive  multispectral  observations.  During 
daylight  hours,  GOES  channels  1,  2,  and  4  are  used  to 
retrieve  cloud- top  temperature,  cloud-top  effective  ra¬ 
dius,  and  cloud  optical  depth.  LWP  is  in  turn  derived 
from  these  variables.  In  this  study,  only  daytime  re¬ 
trievals  were  used  as  they  have  proven  to  be  more  re¬ 
liable.  The  radiative  forward  model  follows  Nakajima 
and  King  (1990),  Miller  et  al.  (2000),  and  Heidinger 
(2003).  The  atmosphere,  cloud,  and  land/ocean  surface 
are  represented  as  three  separate  layers.  Cloud  optical 
depth  calculations  rely  on  reflectance  measurements,  and 
a  plane-parallel,  homogeneous  atmosphere  is  assumed. 


^  The  0600  and  1800  UTC  interim  runs  were  not  verified. 
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Fig.  1.  The  CO  AMPS  81-  and  27-km  eastern  Pacific  operational  domains  are  indicated  by  the 
thin  and  thick  rectangles,  respectively.  The  GOES  analysis  subdomain  extends  from  30°  to  50°N 
and  120°  to  160°W  as  indicated  by  the  dark  dotted  labeled  lines. 


Surface  albedo  is  prescribed  from  the  International 
Geosphere  Biosphere  Program  (IGBP)  dataset.  Cloud 
particles  are  assumed  to  follow  a  Gamma  size  dis¬ 
tribution  with  a  constant  width  throughout  the  cloud 
(Mitrescu  et  al.  2005).  Ice  particles  are  assumed  to  have 
a  spherical  cross  section  with  a  fixed  effective  density.  It 
should  be  noted  that  a  Marshall-Palmer  size  distribu¬ 
tion  is  assumed  for  the  liquid  species  in  the  CO  AMPS 
microphysics  scheme.  Since  mass  mixing  ratio  is  pre¬ 
dicted,  the  assumed  distribution  primarily  affects  the 
partitioning  of  the  particles  and  not  the  total  mass. 
Thus,  the  difference  in  the  distributions  is  not  expected 
to  produce  major  differences  in  the  integrated  LWP 
calculations. 

The  native  footprint  of  the  satellite  retrieval  was  ap¬ 
proximately  4  km.  For  comparison  with  the  CO  AMPS 
output,  these  data  were  linearly  averaged  onto  the  27-km 
forecast  grid.  Representativeness  errors  were  alleviated 
as  the  satellite  data  themselves  represent  averages  over 
an  area.  To  minimize  errors  associated  with  land  albedo, 
low  zenith  angle,  and  model  boundary  effects,  a  data 
window  was  defined  from  160°  to  120°W  longitude  and 
30°  to  50°N  latitude  (Fig.  1).  All  verification  was  con¬ 
ducted  within  the  area  defined  by  this  window. 


Nakajima  and  King  (1990)  and  Miller  et  al.  (2000) 
estimate  optical  depth  and  effective  radius  errors  to  be 
as  high  as  50%,  thus  LWP  may  also  contain  errors  up 
to  a  factor  of  2.  For  this  study,  the  LWP  output  was 
carefully  studied  visually  for  some  indication  of  the 
nature  of  these  errors.  At  low  solar  zenith  angles  LWP 
was  generally  underestimated,  and  occasionally  large 
discrepancies  occurred  when  shadows  developed  in 
regions  of  textured  clouds.  Given  the  observational 
ambiguities  it  was  decided  to  present  the  statistics  in  a 
bias-corrected  format.  Though  quantitative  bias  and 
RMS  errors  of  the  forecasts  will  remain  unknown,  useful 
statistics  can  still  be  derived  pertaining  to  the  forecasts  of 
the  general  positions  of  large  cloud  systems. 

The  bias  was  removed  by  comparing  LWP  measure¬ 
ments  with  the  forecasts  over  bimonthly  intervals  in  a 
series  of  scatter  diagrams.  Direct  point-to-point  com¬ 
parisons  were  of  little  use  because  of  the  high  variability 
and  the  presence  of  phase  errors.  Instead,  the  phase 
errors  were  statistically  removed  by  sorting  the  fore¬ 
casts  and  observations  by  magnitude  at  each  realization 
and  deriving  regression  relationships  from  the  sorted 
distributions  (Fig.  2).  The  resulting  average  LWP  dis¬ 
tributions  exhibited  considerable  variability,  especially 
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Fig.  2.  Scatter  diagrams  of  the  magnitude-sorted  LWP  forecasts  and  observations  for  Feb¬ 
ruary  and  March  2007  for  the  (a)  21-  and  (b)  24-h  forecasts.  Third-order  regression  calculations 
are  represented  by  solid  curves  while  x  =  y  is  represented  by  the  light  dotted  line. 


at  large  values.  During  the  middle  portions  of  the  day 
(Fig.  2a),  the  scatter  was  relatively  uniform  about  the 
X  =  y  line.  However,  by  late  afternoon  (Fig.  2b)  the 
zenith  angle  bias  manifested  itself  as  a  quadratic  error 
that  was  generally  correctable  using  regression  techniques. 
To  minimize  the  effects  of  the  bias,  the  majority  of  the 


statistics  discussed  herein  will  be  limited  to  periods 
when  most  of  the  grid  experienced  relatively  high  zenith 
angle,  namely,  1800-2100  UTC.  As  Fig.  2a  indicates,  the 
systematic  bias  was  relatively  small  during  these  times. 

Given  the  observational  data  caveats,  it  is  not  sur¬ 
prising  that  Chevallier  and  Kelly  (2002)  and  others  note 


October  2009 


NACHAMKIN  ET  AL. 


3489 


that  brightness  temperature  comparisons  are  a  more 
consistent  means  of  comparing  the  model  with  the  ob¬ 
servations  as  retrieval  errors  are  not  incorporated. 
However,  the  derived  products  from  both  the  model 
and  the  satellite  are  regularly  used  for  day-to-day  op¬ 
erations.  Most  user  applications  rely  on  cloud  height 
and  depth  information  that  brightness  temperature  alone 
does  not  provide.  The  compromise  entailed  by  present¬ 
ing  the  results  in  a  user-oriented  space  was  worth  the 
additional  error,  especially  if  both  products  are  widely 
used.  The  results  should  thus  be  viewed  as  a  general 
consistency  check  between  the  predictions  and  obser¬ 
vations  of  large-scale  deep  cloud  areas. 

3.  Results 

a.  Point-to-point  comparisons 

The  21-h  forecast  from  0000  UTC  1  February  (Fig.  3) 
graphically  demonstrates  the  difficulties  associated  with 
point-to-point  verification  of  high-gradient  fields.  The 
most  prominent  features  include  a  north-south  frontal 
band  through  the  central  portions  of  the  analysis  grid 
(near  140°W),  along  with  a  region  of  scattered  post¬ 
frontal  clouds  west  of  the  front  (near  153°W).  While 
the  general  location  of  the  cloud  features  was  correctly 
predicted,  the  scatter  diagram  inset  in  Fig.  3b  shows 
little  correlation  between  the  two  LWP  fields.  A  closer 
look  reveals  that  the  northern  portions  of  the  frontal 
band  lag  to  the  west  of  the  observations,  while  the 
southern  portions  are  too  intense  and  narrow.  The  bro¬ 
ken  nature  of  the  cloudiness  in  the  central  and  southern 
portions  of  the  band  is  also  not  well  simulated.  Cloud 
coverage  to  the  west  of  the  front  is  underrepresented, 
and  the  leading  edge  of  the  cloud  field  is  1°-2°W  of  its 
observed  position.  Although  these  errors  are  relatively 
small  in  scale,  high  LWP  gradients  significantly  reduce 
the  correlations.  The  traditional  verification  scores 
presented  in  this  section  measure  the  direct  field  cor¬ 
respondence,  providing  little  allowance  for  small-scale 
errors  in  high-variance  fields.  Thus,  the  scores  should  be 
interpreted  with  caution. 

The  4-month  average  correlations  (Fig.  4)  reflect 
some  of  the  features  mentioned  in  the  example  above. 
While  many  of  the  forecasts  appeared  to  be  viable  on 
visual  inspection,  the  correlations  were  generally  low. 
Correlation  coefficients  start  below  0.5  at  the  analysis 
time^  and  slowly  decrease  through  the  forecast.  High 
variability  at  small  scales  essentially  acts  as  noise  (Fig.  3), 
reducing  the  overall  strength  of  the  correlations.  The 
slow  degradation  of  the  correlations  with  time  primarily 


^  LWP  values  were  not  assimilated. 


reflects  medium-  to  large-scale  errors.  Daily  correla¬ 
tions  at  the  synoptic  scale  (discussed  later)  were  on  the 
order  of  0.8,  and  these  large-scale  correlations  also 
dropped  very  slowly  with  lead  time.  The  choppy  nature 
of  the  descent  is  likely  due  to  the  variations  in  LWP 
magnitude  with  sun  angle.  GOES  persistence  correla¬ 
tions  were  calculated  by  comparing  the  satellite  obser¬ 
vations  valid  at  the  analysis  time  with  the  observations 
over  the  ensuing  48-h  period.  The  persistence  correla¬ 
tions  drop  rapidly  with  time,  reflecting  the  complex 
structure  and  rapid  evolution  of  the  cloud  field.  GOES 
LWP  persistence  drops  below  CO  AMPS  after  3  h,  while 
the  infrared  (IR)  cloud-top  persistence  forecast  decays 
more  slowly.  The  IR  field  is  probably  a  better  overall 
representation  of  the  satellite  persistence  forecast  decay 
rate  as  this  field  is  not  sensitive  to  solar  zenith  angle. 
Note  also  that  LWP  persistence  was  only  calculated 
against  forecasts  initialized  at  0000  UTC  because  dark¬ 
ness  prevented  the  collection  of  LWP  observations 
between  0600  and  1500  UTC. 

Perhaps  the  most  interesting  aspect  of  Fig.  4  is  re¬ 
flected  in  the  behavior  of  the  lagged  forecast  correla¬ 
tions.  These  correlations  were  generated  by  comparing 
current  observations  with  forecasts  valid  3  h  later  than 
(fm3)  and  3  h  earlier  than  (fp3)  the  observation  time. 
The  fp3  and  on  time  (fOO)  forecast  correlations  were 
about  the  same  magnitude,  while  the  fm3  correlations 
were  considerably  lower.  Tests  using  a  t  distribution 
indicate  that  the  differences  between  the  fOO  and  fm3 
correlations  are  significant  at  the  99%  confidence  level 
for  all  lead  times  prior  to  42  h.  Tests  with  6-h  lags  (not 
shown)  resulted  in  correlations  at  or  below  the  fm3 
levels.  These  results  indicate  that  the  model  tends  to 
be  too  slow  with  the  progression  of  weather  systems 
through  the  region.  The  near  equality  of  the  fp3  and 
fOO  correlations  suggests  a  peak  somewhere  between 
them,  reflecting  an  average  lag  error  in  the  0-3-h  range. 
The  phase  error  is  not  universal,  though,  as  indicated 
by  the  example  in  Fig.  3.  Note  the  northern  portions 
of  the  frontal  band  (located  near  140°W)  are  lagged  in 
the  forecast,  north  of  about  40°N,  while  farther  south 
the  frontal  position  is  better  simulated.  Analysts  at 
FNMOC  have  relayed  similar  anecdotal  accounts  of 
3-6-h  time  lags  in  the  precipitation  forecasts  (J.  Lerner, 
FNMOC  models  and  data  section,  2007,  personal  com¬ 
munication). 

Although  much  of  the  bias  was  removed  from  the 
data,  average  LWP  histograms  (Fig.  5)  still  reveal  some 
interesting  trends.  The  majority  of  the  observed  LWP 
values  fell  into  the  lowest  threshold  categories,  with 
values  below  100  g  m“^  accounting  for  about  60%  of  the 
distribution.  Most  of  the  LWP  values  below  200  g 
derive  from  cirrus  or  boundary  layer  stratus  clouds  that 
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Fig.  3.  (a)  The  CO  AMPS  21-h  LWP  forecast  and  (b)  the  verifying  GOES  analysis.  The  shading  is  in  grams  per 
meter  squared  as  represented  by  the  scale.  The  scatter  diagram  inset  in  (b)  represents  a  point-to-point  comparison 
between  the  predicted  and  observed  fields. 


often  cover  vast  regions  of  the  Pacific.  Deeper  precipi¬ 
tating  clouds  are  generally  associated  with  values  above 
500  g  m~^.  In  the  model,  precipitating  convective  clouds 
exhibited  values  over  3000  g  m“^,  though  the  corre¬ 
sponding  observed  values  rarely  exceeded  2000  g  m~^. 
These  clouds  are  highly  three-dimensional  and  thus 
violate  the  plane-parallel  and  vertical  homogeneity  as¬ 


sumptions  in  the  retrievals.  Multidirectional  scattering 
of  outgoing  radiation  likely  results  in  underestimates  of 
LWP.  A  simple  hydrostatic  calculation  reveals  that  even 
a  modest  average  liquid  mixing  ratio  of  0.5  g  kg“^  over  a 
700-hPa  depth  from  1000  to  300  hPa  results  in  an  LWP 
of  over  3500  g  m~^,  which  is  well  above  any  of  the  ob¬ 
served  values. 


October  2009 


NACHAMKIN  ET  AL. 


3491 


Forecast  hour 


Fig.  4.  Correlation  coefficients  between  the  predicted  and  observed  variables  with  respect  to 
forecast  hour.  GOES  infrared  cloud-top  height  (pers  cth)  and  LWP  (pers  Iwp)  persistence 
forecasts  are  represented  by  the  thick  solid  and  dashed  lines.  The  CO  AMPS  forecasts  valid  at 
the  observation  time  (fOO),  3  h  earlier  than  the  observation  time  (fm3),  and  3  h  later  than  the 
observation  time  (fp3)  are  indicated  by  the  light  dotted,  dashed-dotted,  and  short  dashed  lines, 
respectively. 


The  observed  and  predicted  distributions  deviate  sig¬ 
nificantly  from  one  another  at  values  below  100  g 
The  regression  curves  (Fig.  2)  did  not  remove  these 
biases  as  the  absolute  differences  in  LWP  were  small  and 
highly  compressed  at  the  low  end  of  the  scale.  Correc¬ 
tions  were  primarily  weighted  by  the  very  large  biases 
at  the  high  end  of  the  LWP  distribution.  The  low-end 
biases  likely  result  from  deficiencies  in  subgrid-scale 
cloud  prediction.  To  demonstrate  this,  the  observed 
values  were  reinterpolated  to  the  model  grid  such  that  all 
grid  areas  with  less  than  50%  cloud  coverage  were  au¬ 
tomatically  set  to  a  value  of  LWP  =  0  g  m“^  (clear).  All 
grid  points  covering  observed  areas  of  scattered  or  partly 
cloudy  conditions  were  thus  eliminated.  The  resulting 
distribution,  represented  by  the  bold  horizontal  lines  in 
Fig.  5,  is  much  closer  to  the  forecasts,  especially  for  clear 
skies.  A  notable  deviation  from  this  trend  was  the  ten¬ 
dency  for  the  model  to  produce  too  much  thin  cirrus. 
Much  of  this  overproduction  occurred  in  areas  that  were 
otherwise  covered  by  low  stratus,  thus  leaving  LWP 
statistics  relatively  unaffected. 

The  Hanssen-Kuipers  discriminant  (HK)  and  the 
equitable  threat  score  (ETS)  are  presented  in  Fig.  6  for 
completeness,  as  these  values  are  often  presented  in 
the  verification  literature.  The  HK  discriminant  ranges 
from  —1  to  1,  with  1  being  a  perfect  score,  0  indicating 
no  skill,  and  —1  indicating  a  perfect  negative  correla¬ 
tion.  Ebert  et  al.  (2004)  note  that  HK  is  the  false-alarm 


rate  subtracted  from  the  probability  of  detection.  It 
reflects  the  ability  of  the  forecast  to  discern  between 
cloudy  and  clear  areas.  The  ETS  measures  the  number 
of  correct  forecasts  in  proportion  to  the  total  number  of 
forecasts  and  observations  of  a  given  event,  adjusted  for 
the  probability  of  a  correct  random  forecast.  Since  cloud 
areas  were  often  extensive,  the  random  adjustment 
factor  was  sometimes  as  large  as  the  ETS  itself  for  lower 
LWP  thresholds. 

For  consistency,  the  scores  were  calculated  only  at  the 
time  of  the  maximum  average  zenith  angle  (2100  UTC) 
using  forecasts  initialized  at  both  0000  and  1200  UTC. 
Since  the  scores  are  threshold  based,  a  value  of  500  g 
was  chosen  to  represent  the  deeper,  precipitating  sys¬ 
tems.  These  systems  reflect  the  majority  of  the  well- 
defined  cloud  entities  that  tracked  across  the  region. 
Lower  thresholds  produced  artificially  high  scores  as 
both  cirrus  and  stratus  clouds  could  share  the  same  LWP 
value.  Given  the  poor  point-to-point  correlations  the 
scores  were  generally  low.  However,  both  scores  indi¬ 
cated  some  skill  through  much  of  the  forecast,  with  little 
reduction  with  increasing  lead  time. 

The  sensitivity  of  the  pointwise  scores  to  small-scale 
errors  is  mitigated  by  considering  averages  over  larger 
areas.  For  example,  the  general  ability  for  the  model  to 
discern  between  synoptically  disturbed  and  quiescent 
conditions  is  an  important  indication  of  overall  perfor¬ 
mance,  especially  in  data-sparse  regions.  Figure  7  depicts 
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Fig.  5.  The  bias-corrected  21-h  LWP  forecasts  (light  bars)  and  observations  (dark  bars).  Values  along  the 
vertical  axis  represent  the  average  coverage  over  the  analysis  grid  during  the  4-month  period.  Horizontal  axis 
values  represent  the  lower  bound  of  each  bin.  Standard  deviations  are  depicted  by  the  vertical  solid  lines.  Thick 
horizontal  lines  represent  the  observations  when  all  grid  squares  containing  less  than  50%  cloud  coverage  are 
considered  as  clear. 


the  daily  total  coverage  of  LWP  above  500  g  as  a 
fraction  of  the  entire  analysis  region.  The  model  was 
generally  able  to  predict  periods  of  disturbed  weather, 
though  coverage  errors  on  the  order  of  5%  were  not 
uncommon  on  a  given  day.  The  correlation  coefficients 
at  the  9-,  21-,  33-,  and  48-h  lead  times  were  0.82,  0.87, 


0.80,  and  0.80,  respectively.  These  values  were  consid¬ 
erably  higher  than  the  point-to-point  correlations  in  Fig.  4, 
reflecting  the  improved  performance  at  large  scales.  The 
rate  of  decrease  with  time  was  also  slower  than  that  in 
Fig.  4,  again  due  to  the  large  scale.  The  reduction  in  cor¬ 
relation  with  lead  time  was  primarily  due  to  increasing 
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Fig.  6.  The  4-month-average  HK  discriminant  (solid  line)  and  ETS  (dashed  line)  as  a  function 

of  forecast  lead  time. 
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Fig.  7.  Bias-corrected  21-h  forecasts  and  observations  of  LWP  values  >500  g  m~^  as  a 
function  of  total  coverage  over  the  satellite  analysis  grid.  The  regression  relation  is  represented 
by  the  solid  line  while  x  =  y  is  represented  by  the  dotted  line. 


random  scatter,  and  the  slope  of  the  regression  lines 
remained  relatively  close  to  unity. 

b.  Multiscale  statistics 

A  number  of  new  verification  methods  have  recently 
been  developed  to  sample  forecast  accuracy  over  a  range 
of  length  scales.  Verifying  at  multiple  scales  provides  a 
means  to  gauge  the  spatial  forecast  error.  If  the  errors 
are  small  in  scale  the  statistics  will  rapidly  improve  with 
increasing  sample  area.  Ebert  (2008)  reviews  a  number 
of  these  methods,  some  examples  include  upscaling 
(Zepeda- Arce  et  al.  2000),  wavelet  decomposition 
(Casati  et  al.  2004),  and  the  fuzzy  neighborhood  method 
(Roberts  and  Lean  2008).  Since  clouds  often  occur  as 
fractional  or  amorphous  elements  that  are  difficult  to 
characterize,  the  fuzzy  neighborhood  method  was  par¬ 
ticularly  appealing  as  accuracy  is  expressed  in  terms  of 
event  frequency  (Roberts  and  Lean  2008;  Ebert  2008). 
The  method  was  also  very  simple  to  code  and  it  ran  very 
efficiently  within  the  existing  software. 

The  fuzzy  neighborhood  method  works  by  calculating 
verification  metrics  over  a  series  of  squares,  or  neigh¬ 
borhoods,  of  increasing  size  centered  at  each  grid  point. 
Larger  neighborhoods  allow  for  displaced  forecast 
values  to  be  counted  as  correct.  We  follow  Roberts  and 
Lean  (2008)  in  using  the  fractions  skill  score  (ESS)  as  a 
skill  metric.  The  ESS  is  defined  as 


MSE, 


^  mse, 


(«) 


(1) 


where  MSE(„)  refers  to  the  mean-square  error  over  a 
neighborhood  of  length  n  as  given  by 


Ny 

-■ 
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The  MSE  is  calculated  from  the  binary  field  I  created  by 
imposing  a  threshold  on  the  data  to  be  verified.  Here,  all 
LWP  points  exceeding  500  g  m~^  are  assigned  a  value  of 
1=1,  with  7=0  assigned  to  all  other  points.  The  MSE  is 
summed  for  each  forecast  over  the  entire  analysis  grid 
consisting  of  by  Ny  points,  where  X  Ny  =  9971. 
The  quantities  and  represent  the  sum  over 
each  neighborhood  of  the  observed  and  predicted  com¬ 
ponents  of  the  binary  field  centered  at  each  i,  j  grid 
point: 
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Iwp  >=  500  g/m**2 

Fig.  8.  The  variation  of  the  FSS  with  neighborhood  length  for  the  9-,  21-,  33-,  and  45-h 
forecast  lead  times.  The  FSS  associated  with  the  21 -h  GOES  LWP  persistence  forecast  (p21)  is 
indicated  by  the  dashed-dotted  line.  The  uniform  and  random  FSS  are  displayed  as  horizontal 
dotted  and  dashed  horizontal  lines,  respectively. 
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The  MSE  is  zero  for  all  neighborhoods  where  equal 
numbers  of  observed  and  predicted  points  exceed  the 
threshold  value,  regardless  of  the  position  of  the  points 
within  each  neighborhood.  To  mitigate  fluctuations 
in  the  MSE  related  to  coverage,  as  well  as  to  ensure 
that  the  FSS  retains  a  value  between  0  (no  skill)  and  1 
(perfect  skill)  Roberts  and  Lean  normalize  the  score 
with  a  reference  MSE^^^  representing  the  largest  pos¬ 
sible  MSE  from  a  given 'set  of  observed  and  predicted 
fractions: 


N.  w 
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The  above  system  was  used  to  calculate  the  FSS  over 
scales  ranging  from  a  single  grid  point  (27  km)  to  a 
41  X  41  point  square  (1107  km).  Again  to  minimize  bias 
issues,  observations  from  2100  UTC  were  used  exclu¬ 
sively.  The  resulting  curves  in  Fig.  8  show  the  progres¬ 
sion  of  the  FSS  over  the  range  of  scales  listed  above. 
Point-scale  FSS  values  are  quite  low,  but  the  FSS  rap¬ 
idly  increases  with  increasing  neighborhood  scale.  This 
behavior  is  consistent  with  the  example  in  Fig.  3  re¬ 


flecting  that  the  general  patterns  in  the  forecast  are 
better  than  the  point-scale  values. 

The  neighborhood  scale  with  the  optimum  combina¬ 
tion  of  forecast  accuracy  with  minimal  loss  in  precision 
due  to  averaging  is  largely  up  to  the  individual  user. 
Since  the  neighborhood  method  is  relatively  new  few 
studies  exist  to  offer  comparative  scores  from  other 
cloud  forecasts.  Sohne  et  al.  (2006)  reported  FSS  values 
near  0.8  at  length  scales  of  150  km  for  a  set  of  6-h 
brightness  temperature  forecasts  of  anvil  cloudiness 
for  a  single  flash  flood  case.  This  case  was  clearly  well 
simulated  considering  the  horizontal  grid  spacing  was 
50  km  in  some  of  their  sensitivity  studies.  Their  score  is 
likely  on  the  high  end  of  the  forecast  quality  scale,  es¬ 
pecially  for  convection.  The  eastern  Pacific  FSS  in  our 
study  is  somewhat  lower,  though  probably  more  repre¬ 
sentative  because  of  the  large  sample  size.  Murphy  and 
Epstein  (1989)  also  noted  that  the  FSS  is  sensitive  to 
the  grid  coverage  of  the  field  subject  to  the  threshold. 
Large  coverage  tends  to  score  higher,  thus  the  grid  size, 
threshold,  and  event  climatology  must  be  similar  for 
adequate  comparison. 

Roberts  and  Lean  (2008)  derived  two  simple  mea¬ 
sures  of  quality  based  on  the  FSS  for  random  and  uni¬ 
form  forecasts.  The  FSS  for  random  forecasts  with  the 
same  fractional  coverage  over  the  domain  as  the  event 
to  be  verified  is  simply  equal  to  the  coverage,  which  was 
about  0.12  in  this  case  (Fig.  8).  Uniform  forecasts  were 
defined  as  a  forecast  at  the  gridpoint  scale  (n  =  1)  with 
the  fraction/probability  equal  to  the  average  fractional 
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coverage  of  the  event  to  be  verified.  The  uniform  FSS 
is  defined  as  being  half-way  between  the  random  and 
perfect  skill,  which  in  this  case  was  0.56  (Fig.  8).  Uniform 
forecasts  possess  a  reasonable  amount  of  skill,  but  have 
zero  precision.  Roberts  and  Lean  contend  that  FSS 
scores  above  the  uniform  score  represent  the  smallest 
scale  over  which  the  forecast  contains  useful  informa¬ 
tion.  Applying  this  standard  here  indicates  that  length 
scales  between  135  and  189  km  (5-7  grid  points)  should 
provide  useful  forecasts.  This  result  is  not  unreason¬ 
able  given  well-known  constraints  on  model  resolution. 
Additional  tests  with  the  GOES  persistence  forecasts 
showed  a  large  degradation  by  21  h  (Fig.  8).  The  per¬ 
sistence  FSS  remained  below  the  uniform  skill  threshold 
for  neighborhoods  below  900  km  on  a  side.  Unfortu¬ 
nately,  LWP  persistence  was  not  available  for  most  other 
lead  times  prior  to  21  h  because  of  daylight  constraints. 

c.  Composite  statistics 

The  question  of  forecast  utility  and  optimal  length 
scale  can  be  further  investigated  using  composite  meth¬ 
ods  developed  by  Nachamkin  (2004).  Composites  pro¬ 
vide  visual  and  quantitative  feedback  on  the  average 
state  of  the  forecasts  and  observations  when  specific 
events  are  expected.  The  basic  premise  of  the  composite 
method  involves  identifying  events  of  interest  in  both 
the  forecasts  and  the  observations  and  creating  com¬ 
posites  based  on  the  event  occurrence.  Accurate  fore¬ 
casts  result  in  strong  similarities  between  the  observed 
and  predicted  spatial  distributions,  while  systematic 
spatial  errors  manifest  themselves  as  displacements  in 
the  composited  fields.  For  this  study  moderate-  to  small¬ 
sized  events  were  composited  in  an  effort  to  gauge  the 
mesoscale  performance.  All  deep  cloud  events  con¬ 
taining  between  100  and  600  contiguous  grid  points  with 
LWP  values  equaling  or  exceeding  500  g  m“^  were 
composited.  These  typically  represent  convective  cloud 
clusters,  small  developing  cyclones,  or  cold  cutoff  por¬ 
tions  of  mature  cyclones.  The  events  in  this  composite 
reside  in  the  low  to  midportions  of  the  spatial  range  that 
the  neighborhood  method  indicated  would  be  viable 
forecasts  (Fig.  8). 

The  composite  contingent  on  the  occurrence  of  a 
predicted  event  in  the  21-h  forecasts  is  displayed  in 
Fig.  9.  The  sample  comprised  86  events  distributed  over 
the  analysis  region.  For  a  given  event,  all  points  on  the 
41  X  41  point  (1107  km  X  1107  km)  sample  grid  that  fell 
outside  the  satellite  analysis  region  were  ignored  as  bad 
data.  Thus,  the  gradient  in  the  number  count  in  the 
northern  portions  of  the  composite  (Fig.  9b)  indicates 
that  a  number  of  events  occurred  near  the  northern 
boundary  of  the  satellite  analysis  region  (50°N).  This 
result  reflects  the  prevalence  of  cyclone  activity  north  of 
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Fig.  9.  Statistics  from  the  composite  based  on  the  existence  of  a 
predicted  mesoscale  event  in  the  21-h  forecasts,  (a)  The  average 
observed  (contoured)  and  predicted  (shaded)  LWP  values  are  in 
grams  per  meter  squared,  (b)  The  number  of  valid  data  points  is 
contoured  and  the  region  of  LWP  differences  between  the  ob¬ 
served  and  predicted  fields  >50  g  m~^  at  the  95%  confidence  level 
is  shaded,  (c)  The  observed  (contoured)  and  predicted  (shaded) 
occurrence  frequency  of  LWP  values  >500  g  m“^. 
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40°N.  A  distinct  shift  was  evident  in  both  the  average 
LWP  (Fig.  9a),  and  event  frequency  distributions  (Fig.  9c) 
suggesting  that  the  model  forecasts  were  too  far  south 
and  west  of  the  observations.  Given  that  the  prevailing 
flow  is  often  southwesterly  during  disturbed  weather  the 
temporal  lag  in  Fig.  4  is  consistent  with  this  spatial  phase 
shift.  The  statistical  significance  of  the  spatial  shift  was 
tested  using  a  t  distribution  with  a  null  hypothesis  that 
the  observed  and  predicted  average  LWP  differed  by  at 
least  50  g  m“^.  The  region  of  statistical  significance  at 
the  95%  confidence  level,  indicated  by  the  shading  in 
Fig.  10b,  covers  the  center  of  the  predicted  events,  but 
does  not  extend  northeastward  to  the  corresponding 
lobe  of  increased  observed  LWP  (Fig.  9a).  Reduced 
number  counts  and  high  standard  deviations  lowered 
the  statistical  confidence  in  this  area. 

The  composite  of  the  forecasts  contingent  on  the 
occurrence  of  an  observed  event  (Fig.  10)  depicts  a  less 
distinct,  though  still  apparent  spatial  shift.  The  region  of 
statistical  significance  (Fig.  10b)  extends  farther  north¬ 
east,  due  in  part  to  increased  number  counts.  A  total  of 
124  events  were  sampled,  indicating  more  small-  to 
midsized  events  existed  in  the  observations  than  the 
forecasts.  The  disparity  likely  results  from  the  enhanced 
variability  in  the  observations  compared  to  the  forecasts 
(Skamarock  2004).  Although  the  observations  were  av¬ 
eraged  to  the  model  grid,  a  given  cyclone  often  con¬ 
sisted  of  several  closely  associated  mesoscale  cloud 
areas.  Note  the  broken  nature  of  the  observed  frontal 
cloudiness  in  Fig.  3  compared  to  the  forecast.  As  such 
the  observation-based  composite  likely  included  a  num¬ 
ber  of  essentially  synoptic-scale  events.  Cloud  coverage 
constraints  could  serve  to  filter  the  observed  events, 
though  such  efforts  were  not  attempted  here  as  both 
composites  indicate  the  same  basic  trends. 

A  greater  understanding  of  the  nature  of  the  forecast 
errors  as  well  as  the  verification  results  can  be  achieved 
by  combining  concepts  used  in  both  the  neighborhood 
and  composite  methods.  Although  the  FSS  is  very  use¬ 
ful  for  comparing  forecasts  with  one  another,  applying 
the  FSS  to  obtain  an  optimal  length  scale  is  somewhat 
abstract.  How  do  the  curves  in  Fig.  8  translate  to  fore¬ 
cast  performance?  In  an  attempt  to  address  this  issue, 
events  in  the  composite  sample  above  were  recom¬ 
posited  using  criteria  based  roughly  on  the  FSS.  For  each 
event,  a  single  sample  FSS  was  defined  using  Eq.  (1), 
where  MSE(„)  was  the  binary  MSE  over  the  n  X  n  re¬ 
gion  centered  on  the  event.  The  quantity  MSE(^„)  was 
defined  individually  for  each  event  sample  using  Eq.  (5). 
Event  forecasts  were  grouped  into  high-  and  low-quality 
categories  based  on  whether  the  single  sample  FSS  met 
or  exceed  the  average  FSS  at  scale  n  as  shown  in  Fig.  8. 
While  the  comparison  is  only  an  approximate  analogy,  it 


Fig.  10.  As  in  Fig.  9,  but  for  the  composite  based  on  the  occurrence 
of  an  observed  mesoscale  event. 
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Fig.  11.  The  percentage  of  correct  mesoscale  event  forecasts  (hits)  as  defined  by  the  FSS 
criteria  for  the  5 -(solid  lines)  and  25-point  (dashed  lines)  neighborhood  length.  Solid  triangles 
represent  the  values  derived  from  the  observation-based  composite  while  open  squares  rep¬ 
resent  those  from  the  forecast-based  composite. 


does  provide  insight  regarding  the  ability  of  the  FSS  to 
assess  error  as  well  as  the  nature  of  the  error  associated 
with  the  lower-quality  forecasts. 

Two  FSS  criteria  at  opposite  ends  of  the  spectrum, 
squares  with  sides  of  n  =  5  and  25  grid  points  (135  and 
675  km),  were  applied  to  the  mesoscale  events  depicted 
in  Figs.  9  and  10.  For  consistency,  the  FSS  for  the  21-h 
forecasts  was  applied  to  define  the  forecast  quality  at  all 
lead  times.  Based  on  the  calculations  in  Fig.  8  the 
critical  FSS  values  for  the  5  and  25  point  areas  were  0.57 
and  0.79,  respectively.  As  might  be  expected,  the  five- 
point  criteria  were  quite  restrictive  (Fig.  11).  The  small 
box  size  required  a  large  portion  of  very  close  or  col¬ 
located  positive  forecasts  and  observations  to  exceed  a 
given  FSS.  Only  40%-50%  of  the  events  in  the  fore- 
cast-based  composite  met  or  exceeded  the  critical  FSS, 
while  a  somewhat  greater  55%-65%  of  the  observation- 
based  events  were  accepted.  Relaxing  the  criteria  to 
25  points  pushed  the  acceptance  rate  above  70%  for 
both  composites  during  the  early  portions  of  the  fore¬ 
cast,  with  decreased  rates  at  longer  lead  times.  The  gains 
in  the  forecast-based  composite  were  markedly  higher 
than  those  in  the  observation-based  composite,  re¬ 
flecting  the  smaller  phase  shift  as  well  as  the  likelihood 
that  some  larger-scale  events  were  incorporated  in  the 
sample. 

The  effect  of  the  scale-based  acceptance  criteria  can 
be  illustrated  by  comparing  composites  of  the  forecasts 
that  were  defined  to  be  of  high  and  low  quality.  Note  in 
the  figures  that  the  high-quality  forecasts  are  loosely 
referred  to  as  “hits.”  The  forecast-based  event  fre¬ 
quency  distributions  for  the  21-h  forecasts  show  strong 
agreement  between  the  observations  and  predictions 
for  the  five-point  criteria  hits  (Fig.  12a).  As  mentioned 
above,  the  strict  criteria  selected  only  those  events 
where  the  forecasts  agreed  closely  with  the  observa¬ 


tions.  In  Fig.  12b,  the  composite  of  the  five-point  false 
alarms"^  displays  considerably  less  agreement  between 
the  predicted  and  observed  event  frequencies.  However, 
a  coherent  region  of  observed  events  is  evident  to  the 
north  and  east  of  the  forecasts.  These  observations  result 
from  a  subset  of  shifted  forecasts  that  were  still  useful 
despite  having  failed  the  small-scale  quality  criteria.  In¬ 
creasing  the  box  scale  to  25  points  (Fig.  12c)  shifted  these 
forecasts  to  the  higher-quality  bin.  The  resulting  fre¬ 
quency  distributions  are  similar  to  the  original  composite 
(Fig.  10).  In  fact,  observed  event  frequencies  were  close 
to  zero  over  most  of  the  low-quality  composite  (Fig.  12d), 
indicating  that  these  forecasts  were  truly  false  alarms. 
The  FSS  frequency  distributions  for  the  observation-based 
composite  (not  shown)  display  similar  trends  to  those  de¬ 
picted  in  Fig.  12.  Although  a  greater  number  of  these  events 
met  the  five-point  FSS  criteria,  a  subset  of  phase-shifted 
forecasts  was  still  evident  in  the  five-point  low-quality 
forecasts.  Similarly,  the  25-point  low-quality  forecasts 
depicted  very  little  agreement  between  the  observed 
and  predicted  frequency  distributions.  Forecasts  that 
fail  the  25 -point  criteria  likely  contain  significant  error. 

A  separate  composite  study  using  large  events  with 
sizes  ranging  from  601  to  3000  contiguous  grid  points 
with  LWP  values  at  or  above  the  500  g  depicts 
much  improved  performance  (Fig.  13).  For  both  the 
5-  and  25-point  FSS  criteria,  over  80%  of  the  forecasts 
were  binned  as  high  quality  at  all  lead  times  for  the 
observation-based  composite.  The  forecast-based  rates 
were  somewhat  lower.  The  differences  between  the 
5-  and  25-point  acceptance  rates  were  less  than  those 
for  the  small  events,  primarily  because  of  the  increased 


^  Because  this  composite  is  based  on  the  occurrence  of  a  forecast 
event,  a  low-quality  forecast  will  most  likely  be  a  false  alarm. 
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Fig.  12.  Forecast-based  composites  of  hits  and  false  alarms  as  determined  by  the  (a),(b) 
5-  and  (c),(d)  25-point  FSS  criteria  applied  to  the  21-h  mesoscale  event  forecasts.  Observed 
occurrence  frequencies  of  LWP  values  >500  g  m“^  are  contoured  while  predicted  frequencies 
are  shaded,  (left)  Hits  and  (right)  false  alarms. 


event  size  and  attendant  increase  in  coverage  near  the 
event  center.  Notably,  displacement  errors  were  much 
less  evident,  with  no  statistically  significant  trends.  At 
this  scale,  the  event  shapes  were  quite  complex,  result¬ 
ing  in  increased  variability  near  the  edges.  Also,  fewer 
events  were  sampled,  with  sample  sizes  ranging  from 
38  to  53  depending  on  the  lead  time.  Large  sample  sizes 
may  be  necessary  to  determine  if  no  systematic  shifts 
truly  exist  at  this  scale.  Given  these  caveats,  the  primary 
source  of  error  for  the  large-scale  events  was  a  tendency 
for  the  model  to  predict  too  much  cloud  cover.  The 
resulting  false  alarm  tendency  was  responsible  for  the 
reduced  quality  in  the  forecast-based  composite  (Fig.  13). 

4.  Conclusions 

In  this  study,  the  performance  of  the  CO  AMPS  deep 
cloud  forecasts  was  evaluated  using  observed  GOES 


LWP  retrievals  as  ground  truth.  Manual  inspections  of 
the  LWP  observations  in  conjunction  with  visible  and 
infrared  satellite  imagery  indicate  a  strong  likelihood 
that  deep  cloud  systems  were  well  represented  by  the 
retrievals.  However,  some  care  should  be  taken  when 
interpreting  forecast  quality  from  these  results  as  the 
satellite  observations  are  subject  to  considerable  vari¬ 
ability  that  is  difficult  to  estimate.  The  greatest  errors 
resulted  from  an  underestimate  of  cloud  depth  at  values 
above  800  g  m“^.  Presenting  the  results  in  a  bias-corrected 
form  helped  alleviate  these  issues. 

The  deep  cloud  forecasts  evaluated  during  this  period 
displayed  sufficient  accuracy  with  respect  to  the  satel¬ 
lite  observations  to  be  considered  quite  useful.  The  ma¬ 
jority  of  the  synoptic-scale  systems  (>1000  km)  were 
well  simulated  with  the  primary  error  being  an  over 
abundance  of  deep  cloudiness  in  the  forecasts.  At 
smaller  scales,  the  model  failed  to  capture  the  variability 
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Fig.  13.  As  in  Fig.  11,  but  for  synoptic-scale  events  with  sizes  ranging  from  601  to 
3000  contiguous  grid  points  with  LWP  >  500  g  m~^. 


depicted  in  the  observations.  Observed  synoptic  sys¬ 
tems  often  consisted  of  several  closely  associated  meso- 
alpha-scale  cloud  masses  that  the  model  tended  to 
depict  as  contiguous  areas.  About  40% -50%  of  the 
subsynoptic  event  predictions  exceeded  the  5x5  point 
FSS-based  quality  standard,  displaying  little  or  no  phase 
error.  Another  20% -30%  of  the  subsynoptic  forecasts 
contained  more  moderate  errors,  including  phase  er¬ 
rors,  but  still  passed  a  25  X  25  point  FSS  quality  stan¬ 
dard.  The  remaining  forecasts  contained  significant 
errors.  Lag  correlation  calculations  indicate  the  spatial 
errors  translate  to  a  slow  timing  bias  of  approximately 
3  h.  In  general,  the  model  was  able  to  discern  between 
quiescent  and  cloudy  periods  over  the  scale  of  the  entire 
region  with  correlation  coefficients  of  0.82,  0.87,  0.80, 
and  0.80  at  lead  times  of  9,  21,  33,  and  45  h.  Clear-sky 
coverage  was  overestimated  by  the  model,  though  many 
of  the  overestimates  occurred  in  regions  that  were  partly 
cloudy  in  the  observations.  When  all  observed  cells 
containing  less  than  50%  clouds  counted  were  assigned 
clear  values,  the  underestimate  was  alleviated.  Such  a 
discrepancy  likely  results  from  the  lack  of  subgrid-scale 
cloudiness.  The  forecast  skill  as  defined  by  the  correla¬ 
tions  with  the  observations  indicates  that  the  model  beat 
persistence  at  lead  times  beyond  and  including  9  h. 
Considering  that  these  forecasts  were  located  in  an 
oceanic  region  with  relatively  few  observations,  the  re¬ 
sults  of  this  evaluation  are  quite  encouraging. 

Future  work  in  this  area  will  focus  on  improving  the 
cloud  forecasts.  Additional  studies  are  planned  to  de¬ 
termine  the  effects  of  horizontal  and  vertical  grid  res¬ 
olution  as  well  as  adjustments  to  the  microphysics  and 
turbulence  schemes.  Verification  of  cloud  layer  and 
cloud- top  characteristics  are  also  planned.  New  instru¬ 
ments  such  as  CloudSat  (more  information  is  available 
online  at  http://cloudsat.atmos.colostate.edu),  the  Cloud- 
Aerosol  Lidar  and  Infrared  Pathfinder  Satellite  Obser¬ 
vation  (CALIPSO,  more  information  is  available  online 
at  http://www-calipso.larc.nasa.gov),  and  the  Atmospheric 


Infrared  Sounder  (AIRS,  more  information  is  available 
online  at  http://airs.jpl.nasa.gov)  offer  additional  infor¬ 
mation  regarding  cloud  depth  and  cloud  height  that  can 
be  used  to  refine  current  estimates.  Research  is  cur¬ 
rently  being  conducted  toward  this  end. 
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